Concordance of three approaches for operationalizing outcome definitions for multidrug-resistant TB

BACKGROUND: The WHO provides standardized outcome definitions for rifampicin-resistant (RR) and multidrug-resistant (MDR) TB. However, operationalizing these definitions can be challenging in some clinical settings, and incorrect classification may generate bias in reporting and research. Outcomes calculated by algorithms can increase standardization and be adapted to suit the research question. We evaluated concordance between clinician-assigned treatment outcomes and outcomes calculated based on one of two standardized algorithms, one which identified failure at its earliest possible recurrence (i.e., failure-dominant algorithm), and one which calculated the outcome based on culture results at the end of treatment, regardless of early occurrence of failure (i.e., success-dominant algorithm). METHODS: Among 2,525 patients enrolled in the multi-country endTB observational study, we calculated the frequencies of concordance using cross-tabulations of clinician-assigned and algorithm-assigned outcomes. We summarized the common discrepancies. RESULTS: Treatment success calculated by algorithms had high concordance with treatment success assigned by clinicians (95.8 and 97.7% for failure-dominant and success-dominant algorithms, respectively). The frequency and pattern of the most common discrepancies varied by country. CONCLUSION: High concordance was found between clinician-assigned and algorithm-assigned outcomes. Heterogeneity in discrepancies across settings suggests that using algorithms to calculate outcomes may minimize bias.


B A C K G R O U N D :
The WHO provides standardized outcome definitions for rifampicin-resistant (RR) and multidrug-resistant (MDR) TB. However, operationalizing these definitions can be challenging in some clinical settings, and incorrect classification may generate bias in reporting and research. Outcomes calculated by algorithms can increase standardization and be adapted to suit the research question. We evaluated concordance between clinician-assigned treatment outcomes and outcomes calculated based on one of two standardized algorithms, one which identified failure at its earliest possible recurrence (i.e., failure-dominant algorithm), and one which calculated the outcome based on culture results at the end of treatment, regardless of early occurrence of failure (i.e., success-dominant algorithm). M E T H O D S : Among 2,525 patients enrolled in the multi-country endTB observational study, we calculated the frequencies of concordance using cross-tabulations of clinician-assigned and algorithm-assigned outcomes. We summarized the common discrepancies. R E S U LT S : Treatment success calculated by algorithms had high concordance with treatment success assigned by clinicians (95.8 and 97.7% for failure-dominant and success-dominant algorithms, respectively). The frequency and pattern of the most common discrepancies varied by country. C O N C L U S I O N : High concordance was found between clinician-assigned and algorithm-assigned outcomes. Heterogeneity in discrepancies across settings suggests that using algorithms to calculate outcomes may minimize bias. K E Y W O R D S : drug-resistant tuberculosis; rifampinresistant tuberculosis; treatment outcome; definition Evaluating end-of-treatment (EOT) outcomes for rifampicin-resistant (RR) and multidrug-resistant (MDR) TB is important for clinical care and programmatic monitoring. To facilitate the outcome evaluation for RR/MDR-TB, the WHO provides standardized outcome definitions. 1 Observational research plays an important role in generating evidence to guide RR/MDR-TB treatment recommendations. Standardized application of out-come definitions when analyzing RR/MDR-TB cohorts is one step, among many, that investigators may take to generate valid, comparable results across studies. [2][3][4][5] In practice, however, applying standardized definitions, which include the consideration of both quantity and timing of culture results, may be challenging. 1,6 In clinical settings, consistent outcome assignment may be complicated by differences in treatment duration; availability of microbiological results (relative to the timing of cessation of treatment); availability and use of clinical and radiographic data; adherence to treatment; and information on reasons for treatment change. 6 To enhance standardization, algorithms can be used to calculate outcomes based on pre-specified criteria. Algorithms can also be adapted for different purposes, such as evaluating the effectiveness of an initial regimen without considering the effects of subsequent treatment adjustments or the effectiveness of overall treatment strategies after considering such effects. However, using these algorithms involves numerous decision points guided by careful consideration of the research question.
In this study, we applied WHO outcome definitions to a cohort of patients treated for RR/MDR-TB. We programmed two algorithms, each for a distinct research purpose, and compared algorithm-assigned outcomes to clinician-assigned outcomes, with the overall goal of understanding the extent to which different operationalizations of the same definitions may lead to diverging outcome assignment and potential bias in observational RR/MDR-TB treatment cohorts.

METHODS
Data resource, study design, and participants Data were derived from the endTB observational study, a prospective cohort of patients treated for RR/ MDR-TB with an individualized, longer regimen containing bedaquiline-and/or delamanid, composed according to WHO and local guidance, in one of 17 countries. 7 A total of 2,789 patients were recruited and enrolled between April 2015 and September 2018. Routine clinical and laboratory data were entered into an electronic medical record. For this analysis, we excluded patients treated in the Democratic People's Republic of Korea (n ¼155, the only participating country that a priori implemented shortened regimens for RR/MDR-TB), and those for whom there was no recorded clinician-assigned outcome (n ¼ 82) or an outcome of not evaluated, transferred, or ''treatment adapted'' (n ¼ 27).

WHO outcome definitions
Outcome assignments were based on the 2013 WHO definitions in place during the study period. 1 Under these definitions, an outcome of cure was assigned if patients completed treatment with three or more consecutive negative cultures taken at least 30 days apart after the intensive phase, or after 8 months if there was no intensive phase. Patients whose treatment outcome did not meet the definition of cure but in whom there was no evidence of treatment failure are assigned an outcome of treatment completed. Treatment failure was defined as treatment termination or need for a permanent regimen change of two or more anti-TB drugs due to any of the following reasons: lack of culture conversion by the end of the intensive phase; bacteriological reversion in the continuation phase; detection of acquired resistance to fluoroquinolones or second-line injectable drugs; or adverse drug reactions. Death was defined as death from any cause during treatment. Loss to follow-up (LTFU) was said to have occurred after a treatment interruption of 2 consecutive months.
Operationalization of WHO outcome definitions Clinician-assigned outcomes As part of routine care, clinicians assigned a treatment outcome to each patient based on available laboratory results and clinical information, in accordance with WHO outcomes. 1

Algorithm-assigned outcomes
We created two algorithms based on WHO outcome definitions. The first was designed to establish failure at its earliest possible occurrence (henceforth, the ''failure-dominant algorithm''). The second assigned outcomes based on culture results available at the end of treatment, regardless of whether the patient's experience initially met the definition for early treatment failure (henceforth, the ''success-dominant algorithm''). Outcomes generated under the successdominant algorithm reflected treatment response to the initial regimen and any subsequent regimen changes (i.e., those occurring in response to early treatment failure). For example, a patient who underwent early bacteriological reversion from culture-negative to positive, but whose experience ultimately met the definition for cure at the end of treatment, would be assigned an outcome of failure under the failure-dominant algorithm and an outcome of cure under the success-dominant algorithm.
Outcomes under each algorithm were calculated as follows. The failure-dominant algorithm (Supplementary Figure S1) identified the earliest date of treatment failure after 8 months of treatment (as proxy for the intensive phase). Patients who did not experience treatment failure and were not assigned an outcome of death remained eligible for an outcome of success, which was calculated as in the successdominant algorithm. In the success-dominant algorithm (Supplementary Figure S2), patients were assigned an outcome at the end of treatment based on the available culture results. Longer regimens used were intended to last 18-20 months; in some instances, however, clinicians stopped treatment before 18 months. To reflect the intent that all included treatments were longer regimens, we imposed a minimum treatment duration of 15 months for an outcome assignment of cure or treatment completion. 8 Patients who completed 15 months of treatment were assigned an outcome based on WHO outcome definitions, regardless of whether they became LTFU after this period. Outcome of death was assigned when it occurred. Patients who became LTFU before completing 15 months of treatment were assigned LTFU. Patients who were treated for less than 15 months without indication of LTFU or death were assigned an outcome as follows: 1) treatment failure if patients had at least two cultures after 8 months of treatment and at least one of the following was true: more than one of the last three cultures were positive or the final culture was positive; 2) '',15 months, favorable'' if patients had a maximum of one positive culture and the final culture was negative; 3) '',15 months, unfavorable'' if there were less than two cultures after 8 months of treatment but there was no indication of '',15 months, favorable''.

Statistical analysis
We calculated the frequencies of patients with each outcome across the three approaches to operationalizing WHO outcome definitions. We also calculated the frequencies of concordance between algorithmassigned and clinician-assigned outcomes for each individual outcome and for the broader dichotomous category of treatment success versus no success. To evaluate the overall concordance between clinicianassigned and algorithm-assigned outcomes, we calculated the simple j coefficient and 95% confidence intervals (CIs) for the dichotomous category of treatment success. We summarized the common discrepancies.

Research ethics
The endTB observational study protocol was approved by all study countries (Armenia, Bangladesh, Belarus, Ethiopia, Georgia, Haiti, Indonesia, Kazakhstan, Kenya, Kyrgyzstan, Lesotho, Myanmar, Pakistan, Peru, South Africa, Vietnam) and central ethics review committees for each consortium partner. Patients provided written informed consent for inclusion in the observational cohort.
Discrepancies between clinician-assigned and algorithm-assigned outcomes Most discrepancies in individual outcomes (216/323 [66.9%] for the failure-dominant algorithm and 210/ Of the discrepancies identified, there were two common patterns: one involved a clinicianassigned outcome of treatment completed which was classified as cured by the algorithms; another pattern involved a clinician-assigned outcome of cured which was classified as '',15 months, favorable'' in algorithm-assigned outcomes. A total of 107/323 (33.1%) and 59/269 (21.9%) discrepancies between clinician-assigned outcomes and the failure-dominant and success-dominant algorithms, respectively, affected whether the outcome was classified as successful. Among the discrepancies between clinician-assigned outcomes and those calculated with the failure-dominant algorithm, 65 (60.8%) patients were assigned an outcome indicative of treatment success by the clinician but were classified as no success by the failure-dominant algorithm ( Figure 1). Most of these discrepancies (n ¼ 59, 90.8%) occurred when a clinician assigned an outcome of cured or completed in the presence of at least two positive cultures or after the addition of two new drugs added to the initial regimens by the end of 8 months of treatment. The remaining 42 of 107 patients (39.2%) were assigned an outcome of unsuccessful treatment by the clinician but were classified as an outcome of successful treatment by the algorithm.

Treatment outcomes for MDR-TB
For discrepancies between clinician-assigned outcomes and those calculated with the success-dominant algorithm (Figure 2), the prevailing pattern, which accounted for 86.4% (51/59) of the total discrepancies, was one in which clinicians assigned an outcome of unsuccessful treatment while the algorithm(s) assigned an outcome of successful treatment. Of 51 discrepancies, 32 patients had a clinician-assigned outcome of failure and 19 of LTFU. Most (n ¼ 48, 94.1%) of these outcomes were classified as treatment success by the algorithm because the final cultures were negative.
Heterogeneity in the frequency and type of discrepancies by country Country-specific sample sizes ranged from 5 to 671 (Table 3). Across countries, the frequency of discrepancies between clinician-assigned outcomes and those from each algorithm yielded heterogeneous results. For example, in Country 1, discrepancies typically involved a clinician-assigned outcome of unsuccessful treatment and an algorithm-assigned outcome of successful treatment (8/10 [80%] and 9/11 [81.8%] for failuredominant and success-dominant algorithms, respectively). In Country 8, discrepancies between clinicianassigned outcomes and those from the failure-dominant algorithm tended to go both ways (i.e., 19/31 [61.3%] of discrepancies involved a clinician-assigned outcome of treatment success and algorithm-assigned outcome of no success, while 12/31 [38.7%] of discrepancies involved a clinician-assigned outcome indicative of no treatment success and algorithm-assigned outcome of success). Comparing clinician-assigned outcomes to those derived from the success-dominant algorithm in Country 8 revealed 17 discrepancies, all of which were clinician-assigned outcomes of no treatment success and algorithm-assigned outcomes of success.

DISCUSSION
Outcomes derived using each of two algorithms had high concordance with clinician-assigned outcomes, suggesting that all three approaches generally yielded outcomes that reflect WHO outcome definitions. Although relatively rare, we identified patterns of discrepancies between clinician-assigned and algorithm-assigned outcomes that varied in frequency across settings. For example, the percentage of discrepancies between clinician-assigned and failuredominant algorithm-assigned outcomes ranged from 0 to 20%. While small sample sizes in some countries likely contributed to variability, heterogeneity could also be due to differences in training, experience, and the flexibility with which outcome definitions are applied by clinicians, all of which may differ by setting, leading to varying degrees of bias. 9 Standardized algorithms that calculate outcomes based on available data may reduce human error and heterogeneity in outcome assignment, and thereby reduce bias.
We also found that types of discordancy, such as treatment success by clinicians to no success by algorithms (or vice versa), differed across settings. For example, in some countries, discrepancies occurred in only one direction, while in others, they occurred in both directions. Although the latter may not necessarily bias the absolute proportion, it will cause bias in risk factor and comparative effectiveness analysis or individual patient data meta-analysis if the approaches used for outcome assignment differ by setting. Regimens are also often clustered by setting based on the national TB program guidelines for regimen composition; if some countries or clinicians apply more or less stringent criteria (i.e., the minimum duration of treatment used to define cure) for outcome classification, bias may occur and preclude correct conclusions regarding relative treatment effectiveness. Operationalizing WHO outcome definitions using a consistent approach across settings and reporting the approach used for outcome assignment could reduce such bias.
The approach used for outcome assignment should align with the intended purpose. 6,10-12 Calculated outcomes may reduce heterogeneity across time and place while allowing flexibility to contend with Figure 2 Discrepancies between clinician-assigned outcomes and success-dominant-assigned outcomes. *Denominator ¼ total number of patients with discrepancies (n ¼ 59). † Denominator ¼ number of patients with change from success to no success (n ¼ 8). ‡ Denominator ¼ number of patients with change from no success to success (n ¼ 51). § Patients were treated ,15 months and had no culture results; assigned an outcome of '',15 months, favorable''.
changing outcome definitions and address specific research objectives. [10][11][12] For example, the failuredominant algorithm may be most appropriate for evaluating the effectiveness of an initial regimen because treatment failure is defined as soon as it is identified at the end of 8 months. In this way, the failure-dominant algorithm evaluates treatment response to the initial regimen, but not the combined response of the initial regimen plus subsequent regimen changes made due to treatment failure. Alternatively, the success-dominant algorithm designates an outcome at the end of treatment, regardless of evidence of early treatment failure and subsequent regimen changes; this definition may be most appropriate for evaluating the effectiveness of overall treatment and management strategies, as opposed to the effectiveness of an individual regimen. The TRUNCATE-TB trial provides an informative, illustrative example of assessing a ''strategy''. Study patients who experience relapse after the 8-week shortened regimen will be treated with a standard 24week regimen subsequently. 13 The primary efficacy endpoint, assessed at 96 weeks after randomization, represents a combined strategy for treatment and management of any subsequent relapse. Our success-dominant algorithm aligns with this approach in that it considers the overall treatment experience, including the response to the initial regimen and the management of early treatment failure.
Although algorithms could improve outcome classification by removing bias and variability, clinician-assigned outcomes may sometimes most accurately capture the intent of definitions (e.g., if there were laboratory errors that contributed to a final positive culture in the absence of clinical symptoms). This scenario represents an inherent limit to algorithms; these rely on available data and could be biased if factors considered by clinicians are not available to the analysis team. A second example would be if a culture result was available in clinical chart but not entered into the database. Likewise, algorithms are less informed than clinician-assigned outcomes if culture results are missing. Missingness of culture results could be due to reagent stock-outs, limited laboratory capacity, or a patient's inability to produce an adequate sputum sample (especially during the later stages of treatment). 6 For patients who were unable to produce sputum samples, missing cultures could be considered negative. 12 Recording the reasons for missing sputum samples or culture Evidence from this study supports the use of algorithms to calculate EOT outcomes for RR/ MDR-TB research cohorts. Doing so may be particularly important in multisite studies, or those with follow-up periods encompassing different outcome definitions, in order to reduce heterogeneity across countries and among clinicians and to minimize bias.